Save your workspace in Python

A major issue for me coming to Python from Matlab was how to save my workspaces. This is especially crucial when finalizing results in support of a manuscript. It is painful to have reviewers to ask for other statistics or new analyses and have to run everything over again to address such issues. Also, some analyses take a long time to run. So, how the heck does one save workspace variables into a file in Python? It turns out to be not that difficult. Several established libraries exist for this purpose. One of these libraries, dill, is very good for short-term saves. Others are better for long-term data storage and for sharing with others who might still be dependent on Matlab or who use R.

The code below brings in five options for saving your workspace.

By the way, this code was written on PCs running Linux Mint, an Ubuntu variant, and the Python installation was based on Continuum Analytics Anaconda Python distro.

dill is an extension of Python pickle module that enables saving (serializing) most of the common Python datatypes. It depends on the version of Python and libraries that are installed on the computer that creates the dilled workspace. For me, it is the go-to library for when I am working on analysis on my office PC and need to head out and carry on using my notebook. However, given its limitation (dependence on version of Python and libraries), it does not seem like a good idea for long-term data storage.

numpy has a nice function called savez that saves several arrays into a single file in an uncompressed or compressed format. It is fast to use, but depends on Python. However, recently a library for R called RcppCNPy was written that makes it easy to load and save data in this format.

scipy includes functions for reading and writing Matlab version 4 and 5 files, savemat and [loadmat]https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.io.loadmat.html. These are very useful especially if you are using both Python and Matlab or have collaborators stuck on Matlab.

Perhaps the best long-term storage format is hdf. This format is used for the most recent versions of Matlab and can be directly read into GNU-Octave. Well-established libraries exist for working with hdf files in R and Julia. The HDF Group supplies a viewer for hdf files that makes it easy to check on the contents of a file without reading the file into Python. I have found that two Python libraries, h5py and hdf5storage, useful for working with hdf files in Python. h5py is fast and easy to use. hdf5storage is slower but produces compressed saves by default.

This notebook shows how to use these libraries for saving your workspace in Python. The data set is part of the demo data file provided with NeuronExplorer, written by my grad school lab colleague Alex Kirillov. NeuronExplorer is an excellent tool for working with neurophysiological datafiles. My lab depends on it.

The first step is to import the relevant libraries.


In [1]:
import dill
import numpy as np
from scipy.io import loadmat, savemat
import h5py
import hdf5storage

Switch folders and load the neuronal data, which were parsed out of a nex file using old Matlab code.


In [2]:
%cd ~/Desktop/Spikes-and-Fields/NEx-demo


/home/mark/Desktop/Spikes-and-Fields/NEx-demo

In [3]:
NEx_demo = loadmat('SpikesAndFields.mat')

loadmat puts the variables from the Matlab/Octave workspace into a dict.


In [4]:
Keys = NEx_demo.keys()
print(Keys)


dict_keys(['Event05', '__globals__', 'Neuron06d', 'fn', 'ans', 'ts', 'Neuron04a', '__header__', 'Neuron06b', 'Event06', 'Neuron05b', 'Neuron05c', 'Neuron07a', 'adfreq', 'AD01', '__version__', 'Event04', 'FILE'])

My work style is to put each neuron, lfp, or behavioral event into its own variable in the workspace.


In [5]:
Neuron04a = NEx_demo['Neuron04a']
Neuron05b = NEx_demo['Neuron05b']
Neuron05c = NEx_demo['Neuron05c']
Neuron06b = NEx_demo['Neuron06b']
Neuron06d = NEx_demo['Neuron06d']
Neuron07a = NEx_demo['Neuron07a']
Event04 = NEx_demo['Event04']
Event05 = NEx_demo['Event05']
Event06 = NEx_demo['Event06']
ADmat = NEx_demo['AD01']  # LFP data
adfreq = NEx_demo['adfreq'] # sampling frequency
ts = NEx_demo['ts'] # ts is the temporal offset between spikes/events and fields in the Plexon recording file

Clean up a bit.


In [6]:
%xdel NEx_demo
%xdel Keys

Display the arrays in the workspace.


In [7]:
%whos ndarray


Variable    Type       Data/Info
--------------------------------
ADmat       ndarray    1x610207: 610207 elems, type `float64`, 4881656 bytes (4.655509948730469 Mb)
Event04     ndarray    1x862: 862 elems, type `float64`, 6896 bytes
Event05     ndarray    1x752: 752 elems, type `float64`, 6016 bytes
Event06     ndarray    1x385: 385 elems, type `float64`, 3080 bytes
Neuron04a   ndarray    1x18882: 18882 elems, type `float64`, 151056 bytes (147.515625 kb)
Neuron05b   ndarray    1x13514: 13514 elems, type `float64`, 108112 bytes (105.578125 kb)
Neuron05c   ndarray    1x3053: 3053 elems, type `float64`, 24424 bytes
Neuron06b   ndarray    1x2357: 2357 elems, type `float64`, 18856 bytes
Neuron06d   ndarray    1x3807: 3807 elems, type `float64`, 30456 bytes
Neuron07a   ndarray    1x3824: 3824 elems, type `float64`, 30592 bytes
adfreq      ndarray    1x1: 1 elems, type `float64`, 8 bytes
ts          ndarray    1x1: 1 elems, type `float64`, 8 bytes

Switch to a temporary directory to evaluate saving using the various Python tools.

(I use Dropbox and SpiderOak for backups, but my temp folder is not backed up. I hate wasting bandwidth.)


In [8]:
%cd ~/temp


/home/mark/temp

dill

dill is REALLY useful for saving the entire workspace, e.g. when shutting down notebook, heading home, and picking up work after dinner


In [9]:
%time dill.dump_session('test.pkl')


CPU times: user 0 ns, sys: 12 ms, total: 12 ms
Wall time: 12 ms

In [10]:
ls -lstr test.pkl


5144 -rw-r--r-- 1 mark mark 5264642 Jun 29 15:56 test.pkl
  • dill is an optimal way to save intermediate files or your workspace
  • e.g. working in coffee shop and want to save progress while coding; save end-of-day coding and pick up on notebook at home after dinner

In [11]:
%reset -f
%who


Interactive namespace is empty.

In [13]:
import dill
%time dill.load_session('test.pkl')
%whos


CPU times: user 0 ns, sys: 8 ms, total: 8 ms
Wall time: 4.15 ms
Variable      Type        Data/Info
-----------------------------------
ADmat         ndarray     1x610207: 610207 elems, type `float64`, 4881656 bytes (4.655509948730469 Mb)
Event04       ndarray     1x862: 862 elems, type `float64`, 6896 bytes
Event05       ndarray     1x752: 752 elems, type `float64`, 6016 bytes
Event06       ndarray     1x385: 385 elems, type `float64`, 3080 bytes
In            list        n=10
Neuron04a     ndarray     1x18882: 18882 elems, type `float64`, 151056 bytes (147.515625 kb)
Neuron05b     ndarray     1x13514: 13514 elems, type `float64`, 108112 bytes (105.578125 kb)
Neuron05c     ndarray     1x3053: 3053 elems, type `float64`, 24424 bytes
Neuron06b     ndarray     1x2357: 2357 elems, type `float64`, 18856 bytes
Neuron06d     ndarray     1x3807: 3807 elems, type `float64`, 30456 bytes
Neuron07a     ndarray     1x3824: 3824 elems, type `float64`, 30592 bytes
Out           dict        n=0
adfreq        ndarray     1x1: 1 elems, type `float64`, 8 bytes
dill          module      <module 'dill' from '/hom<...>ckages/dill/__init__.py'>
h5py          module      <module 'h5py' from '/hom<...>ckages/h5py/__init__.py'>
hdf5storage   module      <module 'hdf5storage' fro<...>hdf5storage/__init__.py'>
loadmat       function    <function loadmat at 0x7f7ad0077158>
np            module      <module 'numpy' from '/ho<...>kages/numpy/__init__.py'>
savemat       function    <function savemat at 0x7f7ad00771e0>
ts            ndarray     1x1: 1 elems, type `float64`, 8 bytes

compare hdf5, hdf5storage, np's save, and scipy's savemat

Unlike numpy's savez and scipy's savemat, hdf5 needs to know the datatypes that are to be saved. For this example, ADmat, W, and icasig are ndarrays and adfreq and ts are floats.

HDF5


In [14]:
%%time
with h5py.File('test.h5', 'w') as hf:
    hf.create_dataset('ADmat', data=ADmat, compression="gzip", shuffle=True)
    hf.create_dataset('adfreq', data=adfreq, compression="gzip", shuffle=True)
    hf.create_dataset('ts', data=ts, compression="gzip", shuffle=True)
    hf.create_dataset('Event04', data=Event04, compression="gzip", shuffle=True)
    hf.create_dataset('Neuron04a', data=Neuron04a, compression="gzip", shuffle=True)


CPU times: user 120 ms, sys: 4 ms, total: 124 ms
Wall time: 123 ms

In [15]:
ls -lstr test.h5


1588 -rw-r--r-- 1 mark mark 1626060 Jun 29 15:59 test.h5

all files verified using hdf viewer; file loads directly into Octave and Matlab


hdf5storage -- default options are much slower than direct calls to h5py; however, the file is saved more efficiently, and this effect is more apparent with more LFP channels


In [16]:
# dict is used to set up variables for hdf5storate.writes
vars = {'ADmat':ADmat, 'adfreq':adfreq, 'ts':ts, 'Event04':Event04, 'Neuron04a':Neuron04a}

In [17]:
%%time
hdf5storage.writes(vars, filename='test_hdf5storage.h5')


CPU times: user 404 ms, sys: 12 ms, total: 416 ms
Wall time: 415 ms

In [18]:
ls -lstr test_hdf5storage.h5


1544 -rw-r--r-- 1 mark mark 1578914 Jun 29 16:03 test_hdf5storage.h5
  • hdf5storage is easy to use, handles compression without effort, and is based on a stable format (H5) that can also be read into Matlab

np.savez


In [19]:
%%time
np.savez('test', 'ADmat':ADmat, 'adfreq':adfreq, 'ts':ts, 'Event04':Event04, 'Neuron04a':Neuron04a)


CPU times: user 20 ms, sys: 28 ms, total: 48 ms
Wall time: 61.6 ms

In [20]:
ls -lstr test.npz


4924 -rw-r--r-- 1 mark mark 5040524 Jun 29 16:05 test.npz

this format seems to be stable and standard; a library exists for reading it into R (RcppCNPy: https://cran.r-project.org/web/packages/RcppCNPy/vignettes/RcppCNPy-intro.pdf); this would be useful for saving intermediate files between R and Python, and could be used as a long-term option, for Python only; however, the compressed options are no better than 8% (from my testing)


savemat (v5 matlab, from scipy)


In [21]:
%%time
savemat('test.mat', vars)


CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 6.55 ms

In [22]:
ls -lstr test.mat


4924 -rw-r--r-- 1 mark mark 5040072 Jun 29 16:08 test.mat

this is a fast way to save data in a format that is easily read into and out of Matlab/Octave

this format is much faster than H5, but does not offer compression and requires a slow (and complex) library to be read into R (R.matlab)


clean up


In [23]:
ls -lstr


total 18124
5144 -rw-r--r-- 1 mark mark 5264642 Jun 29 15:56 test.pkl
1588 -rw-r--r-- 1 mark mark 1626060 Jun 29 15:59 test.h5
1544 -rw-r--r-- 1 mark mark 1578914 Jun 29 16:03 test_hdf5storage.h5
4924 -rw-r--r-- 1 mark mark 5040524 Jun 29 16:05 test.npz
4924 -rw-r--r-- 1 mark mark 5040072 Jun 29 16:08 test.mat

In [24]:
rm test*.*

In [25]:
ls

feather and blospack are other options; blospack is very fast compression, but the Github page cautions about long term stability; same is true for feather; it is very fast but again the Github page cautions on stability and the Python implementation currently does not work with row-to-column conversions